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Abstract 

Information relaxation and duality in Markov decision processes have been studied recently by several researchers 
with the goal to derive dual bounds on the value function. In this paper we extend this dual formulation to controlled 
Markov diffusions: in a similar way we relax the constraint that the decision should be made based on the current 
information and impose penalty to punish the access to the information in advance. We establish the weak duality, 
strong duality and complementary slackness results in a parallel way as those in Markov decision processes. We 
explore the structure of the optimal penalties and expose the connection between Markov decision processes and 
controlled Markov diffusions. We demonstrate the use of the dual representation for controlled Markov diffusions in 
a classic dynamic portfolio choice problem. We evaluate the lower bounds on the expected utility by Monte Carlo 
simulation under a sub-optimal policy, and we propose a new class of penalties to derive upper bounds with little 
extra computation. The small gaps between the lower bounds and upper bounds indicate that the available policy is 
near optimal as well as the effectiveness of our proposed penalty in the dual method. 

I. Introduction 

Markov decision processes (MDPs) and controlled Markov diffusions play a central role respectively in modeling 
discrete-time and continuous-time dynamic decision making problems under uncertainty, and hence have wide 
applications in diverse fields such as engineering, operations research and economics. However, the standard 
approach of solving for optimal polices via dynamic programming and Hamlton-Jacobi-Bellman (HJB) equation 
suffers from the "curse of dimensionality"- the size of the state space increases exponentially with the dimension 
of the state. Many modern approximate dynamic programming methods have been proposed for solving MDPs 
in recent years to combat this curse of dimensionality, such as [1], [2], [3], [4]. These methods often generate 
sub-optimal policies, simulation under which leads to lower bounds (or upper bounds) on the optimal expected 
reward (or cost). Though the accuracy of the sub-optimal policies is generally unknown, the lack of performance 
guarantee on sub-optimal policies can be potentially addressed by providing a dual bound, i.e., an upper bound (or 
lower bound) on the optimal expected reward (or cost). Valid and tight dual bounds based on a dual representation 
of MDPs were recently developed by [5] and [6]. The main idea of this duality approach is to allow the decision 
maker to foresee the future uncertainty but impose a penalty for getting access to the information in advance. In 
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addition, this duality approach only encompasses pathwise deterministic optimization problems and therefore is 
well-suited to Monte Carlo simulation, making it useful to evaluate the quality of sub-optimal policies in complex 
dynamic systems. 

This dual formulation of MDPs is attractive in both theoretical and practical aspects. On one hand, the idea of 
relaxing the constraint on the non-anticipative policies in the setting of MDPs at least dates back to [7], as exposed 
by [8]. In addition, the optimal penalty is not unique: for general problems we have the value function-based penalty 
developed by [5] and [6]; for problems with convex structure there is an alternative optimal penalty, that is, the 
gradient-based penalty, as pointed out by [9]. On the other hand, in order to derive tight dual bounds, various 
approximation schemes based on different optimal penalties have been proposed including [6], [9], [10], [11]. We 
notice that the real implementation of computing the dual bounds based on this dual framework has just begun, 
and it has found increasing applications in different fields of problems such as [12], [9], [13], [14], [15]. 

The goal of this paper is to extend the information relaxation approach and the dual representation of MDPs to 
controlled Markov diffusions. The motivation is that the HJB equation rarely allows a closed-form value function, 
especially when the dimension of the state space is high or there are constraints on the control space. Many 
numerical methods have been developed based on different approximation schemes: [16] considered the Markov 
chain approximation method by discretizing the HJB equation; [17] extended the approximate linear programming 
method to controlled Markov diffusions. Another standard numerical approach is to discretize the time space, 
which reduces the original continuous-time problem to an MDP and hence the techniques of approximate dynamic 
programming can be applied. Since the quality of a numerical solution is hard to justify in many problems, we are 
interested in deriving a tight dual bound on the value function of a controlled Markov diffusion by formulating its 
dual representation. Around this topic some central questions are 

• Can we establish a similar framework of dual formulation for controlled Markov diffusions based on information 
relaxation as that for MDPs? 

• If the answer is yes, what is the form of the optimal penalty in the setting of controlled Markov diffusions? 
Is the optimal penalty unique? 

• If certain optimal penalty exists, does it help to facilitate the computation of the dual bound on the value 
function? 

The answer to the first question is yes, at least for a wide class of controlled Markov diffusions. To fully answer 
all the questions we should employ the technical machinery "anticipating stochastic calculus" (see, e.g., [18], [19]). 
To ease reading we first present the information relaxation-based dual formulation of controlled Markov diffusions 
without using the heavy machinery. We establish the weak duality, strong duality and complementary slackness 
results in a parallel way as those in the dual formulation of MDPs. The complete answers to the second question 
are postponed to Appendix D, where we develop all the needed technical machinery and investigate one type 
of the optimal penalties, the so-called "value function-based penalty". Then we emphasize on the computational 
aspect using the result of this dual representation so as to answer the third question. One key feature of the value 



function-based optimal penalty is that it can be written compactly as an Ito stochastic integral under the natural 
filtration generated by the Brownian motions. This compact expression potentially enables us to design sub-optimal 
penalties in simple forms and also facilitates the computation of the dual bound. A direct application is illustrated 
by a classic dynamic portfolio choice problem with predictable returns and intermediate consumptions: we consider 
the numerical solution to a discrete-time model that is discretized from a continuous-time model; an effective class 
of penalties that are easy to evaluate is proposed to derive dual bounds on the value function for the discrete-time 
model. 

It turns out that [20], [21], [22] have pioneered a series of related work for controlled Markov diffusions. They 
also adopted the approach of relaxing the future information and penalizing, which is much earlier than the dual 
framework of MDPs established. In particular, [20] proposed a Lagrangian approach for penalization, where the 
Lagrangian term plays essentially the same role of a penalty in our dual framework; in addition, we find that the 
Lagrangian term has a similar flavor of the gradient-based penalty proposed by [9] for MDPs. The early work of 
[20] is not widely known maybe due to its technical complication. The main difference of their work compared 
with ours is that we propose a more general framework that may incorporate their Lagrangian approach as a special 
case; the optimal penalty we develop in this paper is value function-based while their Lagrangian approach behaves 
like a gradient-based penalty. In addition, their work is purely theoretical and does not suggest any computational 
method. In contrast, we provide a numerical example to demonstrate the practical use of our dual formulation. We 
summarize our contributions as follows: 

• We establish a dual representation of controlled Markov diffusions based on information relaxation. We also 
explore the structure of the optimal penalty and expose the connection between MDPs and controlled Markov 
diffusions. 

• Based on the result of the dual representation of controlled Markov diffusions, we demonstrate its practical 
use in a dynamic portfolio choice problem. In many cases the numerical results of the upper bounds on the 
expected utility show that our proposed penalties are near optimal, comparing with the lower bounds induced 
by sub-optimal policies for the same problem. 

The rest of the paper is organized as follows. In Section II, we review the dual formulation of MDPs and derive 
the dual formulation of controlled Markov diffusions. In Section III, we illustrate the dual approach and carry out 
numerical studies in a dynamic portfolio choice problem. Finally, we conclude with future directions in Section IV. 
We leave most of the proofs and discuss the connection between [20], [9] and our work in Appendix. 

II. Controlled Markov Diffusions and Its Dual Representation 

We begin with a brief review of the dual framework on Markov Decision Processes that is first developed by [5] 
and [6] in Section II-A. We then give the basic setup of the controlled Markov diffusion and its associated Hamilton- 
Jacobi-Bellman equation in Section II-B. We develop the dual representation of controlled Markov diffusions and 
present the main results in Section II-C. 



A. Review of Dual Formulation of Markov Decision Processes 

Consider a finite-horizon MDP on the probability space (£2,Sf ,P). Time is indexed by J?T = {0, 1,- • • ,K}. Suppose 
%~ is the state space and srf is the control space. The state {x k } follows the equation 

=/(**>«*) VJfc+i), fe = 0,l,--- (1) 

where a k £ g/ k is the control whose value is decided at time k, and v k is a random variable taking values in the 
set y with a known distribution. The evolution of the information is described by the filtration G = • • • ,@k} 
with = Sfjf . In particular, each is ^-adapted. 

Denote by A the set of all control strategies a = (ao,--- ,ajf_i), i.e., each a* takes value in Let Ac be 
the set of control strategies that are adapted to the filtration G, i.e., each a k is ^-adapted. We also call a £ Ag 
a non-anticipative policy. Given an xq £ SC , the objective is to maximize the expected reward by selecting a 
non-anticipative policy a £ A&: 



V (x ) = sup 7 (x ;a), 



where Jo(xo;a) = E 



Ar=0 



(2) 



The expectation in (2) is taken with respect to the random sequence v= (vi,--- ,vg). The value function Vo is a 
solution to the following dynamic programming recursion: 

V k (xk)=A(x k ); 

Vk{xk)= sup {gk(xk,ak)+ E[V k+l (x k+l ) \x k ,a k }}, k = K—l,- - ,0. 

Next we describe the dual formulation of the value function Vq{xq). Here we only consider the perfect information 
relaxation, i.e., we have full knowledge of the future randomness, since this relaxation is usually more applicable 
in practice. 

Define Eo. x [-] = E[-|*o —x\. Let ^g(O) denote the set of dual feasible penalties M(a,v), which do not penalize 
non-anticipative policies in expectation, i.e., 

E 0jX [M(a,v)] < for all x £ 5C and aeA G . 

Denote by S> the set of real-valued functions on SC. Then we define an operator Jz? : 

K-l 



{J?M)(x) =E , X 



SU P{ E 8k(xk,ak) + Hx K ) - M(a,y)} 

aeA k=0 



(3) 



Note that the supremum in (3) is over the set A not the set A F , i.e., the control a k is made based on the exposed future 
information. The optimization problem inside the expectation in (3) is usually referred to as the inner optimization 
problem. In particular, the right hand side of (3) is well-suited to Monte Carlo simulation: we can simulate a 
realization of v = {vi, - • • , vk} and solve the inner optimization problem: 



K-l 

/(x,M,v)=max £ gk{x k ,a k ) + A(x K ) -M(a,v) (4a) 

a £=0 

s.t. xo =x, 

JCjfc+i =/(^,«S:,vt + i), k = 0,--- ,K-1, (4b) 
a k e^ k , k = 0,--- ,K-l, (4c) 

which is in fact a deterministic dynamic program. The optimal value I(x,M,v) is an unbiased estimator of (JzfM)(x). 
Theorem 1 below establishes a strong duality in the sense that for all xq <E 3£ , 

sup7n(^o;a)= inf (Jz?M)(x ). 

In particular, Theorem 1(a) suggests that «5fM(xo) can be used to derive an upper bound on the value function 
Vb(xo) given any M e ^£q(0), i.e., I(xq,M,y) is a high-biased estimator of Vo(*o) for all xq e 3£\ Theorem 1(b) 
states that the duality gap vanishes if the dual problem is solved by choosing M in the form of (5). 

Theorem 1 (Theorem 2.1 in [6]) 

1) (Weak Duality) For all M <= ^b(O) and all x <= 3C , V {x) < (3fM)(x). 

2) (Strong Duality) For all x e X, V {x) = (££M*)(x), where 

K-l 

M*(a,v) = £ (V k+1 (x k+1 )-E[V k+1 (x k+1 )\x k ,a k }). (5) 

k=0 

Remark 1 

1) Note that the right hand side of (5) is a function of (a,v), since {x k } depend on (a,v) through the state 
equation (1). 

2) The reason that M € ~#g(0) is called a (dual feasible) penalty function becomes clear: if the relaxation of 
the requirement on the non-anticipative policies is penalized by using a proper function in (0), then the 
value function Vq can be recovered via the dual approach due to the strong duality result. 

3) Note that the optimal penalty M*(a,v) is the sum of a G-martingale difference sequence when a G A^; 
therefore, M*(a,v) £ ^#g(0). Since M* depends on the value function {V k }, it is refereed to as the value 
function-based penalty. 

The optimal penalty (5) that achieves the strong duality involves the value function {V^}, and hence is intractable 
in practical problems. In order to obtain tight dual bounds, a natural idea is to derive sub-optimal penalty functions 
based on a good approximate value function {V k } or some sub-optimal policy a. Methods based on these ideas 
have been successfully implemented in the American option pricing problems by [23], [24], [25], and also in [6], 
[12], [13]. However, these approaches cannot be extended immediately to general continuous-state MDPs in parallel 
with the American options pricing problem. The first difficulty, as pointed out by [8], is that E^+i^+i)^^] 



usually cannot be written as an analytic function of x k and a k . Though this conditional expectation may be evaluated 
approximately after discretizing the state space, it could be time-consuming when f is of high dimension. Second, 
even if the penalty Vk+i{ x k+i) —^[Vk+i{ x k+i)\xk,ak] can be computed analytically, the inner optimization problem 
(4) may still be difficult to solve since no convex structure can be guaranteed (even assuming that (4) is convex with 
M = 0). To overcome these difficulties, [9] introduced the gradient-based penalty in the context of dynamic portfolio 
optimization with transaction costs, and [10] derived a penalty by employing a parameterized quadratic function 
to approximate the value function in linear systems with convex costs. [11] proposed the idea of parametrization 
on penalties directly to avoid starting with approximate value functions. Furthermore, [10] explored the connection 
between information relaxation duality method and the approximate linear programming approach proposed by [4], 
and [8] revealed that in linear-quadratic problems the value function-based penalty and gradient-based penalty are 
both optimal, but in different senses. 

B. Controlled Markov Diffusions and Hamilton- J ac obi-Bellman Equation 

This subsection is concerned with the control of Markov diffusion processes. Applying the Bellman's principle 
of dynamic programming leads to a second-order nonlinear partial differential equation, which is referred to as the 
Hamilton-Jacobi-Bellman equation. For a comprehensive treatment on this topic we refer the readers to [26]. 

Let us consider an R"-valued controlled Markov diffusion process (x ( )o< f <r driven by an m-dimensional Brownian 
motion (w f )o<r<r on a probability space (£1,^,F), following the stochastic differential equation (SDE): 

dx, = b(t,x t7 u t )dt + a(t,x t ,u t )dwt 7 < t < T, (6) 

where u t e C R du is the control applied at time t , and b and a are functions b : [0, T] x R" x a // — >• R" and 
(7 : [0, T] x W x <fy — > M" xm . The natural (augmented) filtration generated by the Brownian motions is denoted by 
¥ = {J? t ,0<t <T} with & = J? T - In the following || • || denotes the Euclidean norm. 

Definition 1 A control strategy u = («.s). sG [ fj r] is called an admissable strategy at time t if 

1) u = {u s ) se [ t j] is an ¥ -progressively measurable process taking values in (i.e., u is a non-anticipative 
policy), and satisfying E[f t T \ \u s \\ 2 ds] < °°; 

2) E fjX [sup sG[r :r] ||x. s || 2 ] < oo, where E,. x [-} = E[-\x t =x\ 

The set of admissible strategies at time t is denoted by ^(t). With the standard technical conditions imposed on 
b and a (specified in Appendix A), the SDE (6) admits a unique pathwise solution when u e ^(0). 

Let Q = [0, T) x W and Q = [0, T] x W. We define the functions A : W -> R and g : Q x -> R as the final 

reward and intermediate reward, respectively. Then we introduce the reward functional 

rT 

J(t,x;u) =¥, tX [A(x T ) + J g(s,x s ,u s )ds]. 

The final reward A and the intermediate reward g satisfy the polynomial growth conditions that are specified in 
Appendix A. Given an initial condition (t,x) e Q, the objective is to maximize J(t,x,u) over all the control u in 



V(t,x) = sup J(t,x;u). (7) 

U6%(f) 

Here we abuse the notation of the state x, the rewards A and g, and the value function V, since they play the same 
roles as those in MDPs. 

Let C 1 2 (g) denote the space of function L(t,x) : Q -> K that is C 1 in r and C 2 in x on g. For L G C 1,2 (g), define 
a partial differential operator A" by 

A u L(t,x) = L t (t,x)+L x (t,x) T b(t,x,u) + ^ti(L xx (t,x){aa T ){t,x,u)), 

where L t , L x , and L xx denote the f-partial derivative, the gradient and the Hessian with respect to x respectively, 
and (aa J )(t,x,u) = a(t 7 x,u)a(t,x 7 u) J . Let C P (Q) denote the set of function L(t,x) : g— > R that is continuous on 
Q and satisfies a polynomial growth condition in x, i.e., 

\L(t,x)\<C L (l+\\x\n 

for some constants Q, and c L . The following well-known verification theorem provides a sufficient condition for 
the value function and an optimal control strategy using Bellman's principle of dynamic programming. 

Theorem 2 (Verification Theorem, Theorem 4.3.1 in [26]) Suppose that V G C 1 ' 2 (Q)nC p (Q) satisfies 

sup{g(t,x,u)+A"V(t,x)} = 0, (t,x)GQ (8) 

and V(T,x) =A(x). Then 

(a) J(t,x;u) < V(t,x) for any u € ^w(t) and any (t ,x) G Q. 

(b) If there exists a function u* : Q — > °i/ such that 

g(t 7 x,u*(t,x))+A u ^^ x) V{t,x) =max{g(t,x,u)+A"V{t,x)} =0 (9) 

for all (f,x) e Q and if the control strategy defined as u* — (w*) ?e [o,r] with u t — u*{t,x t ) is admissible at time 
(i.e., u* e W ¥ (0)), then 

1) V (f, x) =V (t,x) =sup„ e%(r) y(f,x;M). for all (t,x) G Q. 

2) u* is an optimal control strategy, i.e., V(0,x) = J(0,x;u*). 

Equation (8) is the well-known HJB equation associated with the stochastic optimal control problem (6)-(7). 
However, the existence of V e C l,2 (Q) in Theorem 2 requires many technical assumptions that might not be 
true in practice. For example, the HJB equation is usually assumed to be of uniformly parabolic type if there exists 
c a > such that for all (t,x,u) e Q x W and B, e M", 

C(oa T )(t lX ,u^>c a n\\ 2 . 

Otherwise, a classic solution V G C l,2 (Q) may not be expected and we need to interpret the value function as a 
viscosity solution to the HJB equation (see, e.g., [26]). 



C. Dual Representation of Controlled Markov Diffusions 

In this subsection we present the information relaxation-based dual formulation of controlled Markov diffusions. 
In a similar way we relax the constraint that the decision at every time instant should be made based on the current 
information and impose penalty to punish the access to the future information. We will establish the weak duality, 
strong duality and complementary slackness results for controlled Markov diffusions, which parallel the results in 
MDPs. The value function-based optimal penalty is also characterized to motivate the practical use of our dual 
formulation, which will be demonstrated in Section III. 

We consider the perfect information relaxation, i.e., we can foresee all the future randomness generated by the 
Brownian motion so that the decision made at any time t £ [0,T] is based on the information set & = To 
expand the set of the feasible controls, we use if) to denote the set of measurable % -valued control strategies 
at time f, i.e., u = {u s ) s e[t.T] <= ^( f ) if u is &l([t,T]) x ^-measurable and u s takes value in "% for s £ [t,T], where 
3§{[t,T}) is the Borel c-algebra on [t,T]. In particular, <^(0) can be viewed as the counterpart of A introduced in 
Section II- A for MDPs. 

Unlike the case of MDPs, the first technical problem we have to face with is to define a solution of (6) with 
an anticipative control u£ "W(0). Since it involves the concept of "anticipating stochastic calculus", we postpone 
the relevant details to Appendix D-A, where we use the decomposition technique to define the solution of an 
anticipating SDE following [20], [18]. For this purpose we need to restrict the dependence of o(t,x,u) in (6) only 
on t and x, i.e., we have a : [0,T] xM"4 K" xm in this subsection. 

Right now we assume that given a control strategy ue^(0) there exists a unique solution (x t ) t e[o t T] to (6) that 
is &([0,T]) x ^-measurable. Next we consider the set of penalty functions in the setting of controlled Markov 
diffusions. Suppose h(u,w) is a function depending on a control strategy u€^(0) and a sample path of Brownian 
motion w = (w t ) te p T y We define the set ^#f(0) of dual feasible penalties h(u,w) that do not penalize non- 
anticipative policies in expectation, i.e., 



In the following we will show ^#p(0) parallels the role of ^#g(0) for MDPs in the dual formulation of controlled 
Markov diffusions. 

With an arbitrary choice of h £ ^#p(0), we can determine an upper bound on (7) with t = by relaxing the 
constraint on the adaptiveness of control strategies. 

Proposition 1 (Weak Duality) If h £ ^# F (0), then for all x £ W, 



Eo^[/i(u,w)] < for all and u £ %(0). 




(10) 



Proof: Proof. For any u e %(0), 

7(0, x;u) =E 0jX [A(x r ) + f g(t,x t ,u t )dt] 
Jo 

<E 0jX [A(xr)+ / g(t,x t ,u t )dt-h(u,w)] 
Jo 

<E 0jX [ sup {A(x T )+ f g(t,x t ,u t )dt - A(u,w)}]. 

uG<^(0) 

Then inequality (10) can be obtained by taking the supremum over u G ^f(O) on the left hand side of the last 
inequality. ■ 

The optimization problem inside the conditional expectation in (10) is the counterpart of (4) in the context of 
controlled Markov diffusions: an entire path of w is known beforehand (i.e., perfect information relaxation), and 
the objective function depends on a specific trajectory of w. Therefore, it is a deterministic and path-dependent 
optimal control problem parameterized by w. We also call it an inner optimization problem, and the expectation 
term on the right hand side of (10) is a dual bound on the value function V(0,x). [20], [22], [21] have conducted a 
series of research on this problem under the name "anticipative stochastic control"; in particular, one of the special 
cases they have considered is h = 0, which means the future information is accessed without any penalty. [20] 
characterized this reward due to the perfect information relaxation by a PDE. We would expect that the dual bound 
associated with zero penalty can be very loose as that in MDPs. Suppose the inner optimization problem can be 
solved by some technique. Then the evaluation of the dual bound is well suited to Monte Carlo simulation: we can 
generate a sample path of w and solve the inner optimization problem in (10), the solution of which is a high-biased 
estimator of V(0,x). 

An interesting case is when we choose 

h*(u,w)=A(x T )+ [ g(t,x t ,u t )dt-V{0,x). (11) 
Jo 

Note that h* e ^#f(0), since 

Eox[A(x r ) + f g(s,x s ,u s )ds] < V(0,x) for all and u € <2fr(0), 

Jo 

by the definition of V(0,x). We also note that by plugging h=h* in the inner optimization problem in (10), the 
objective value of which is independent of u and it is always equal to V(0,x). So the following strong duality result 
is obtained. 

Theorem 3 (Strong Duality) For all x e W, 

sup J(0,x;u)= inf <E 0x [ sup {A(x T )+ f g(t,x t ,u t )dt - h(u,w)}]\ . (12) 

«e%(0) heJt v (0) y „e<?/(0) Jo J 

The minimum of the right hand side of (12) can always be achieved by choosing an h <G ^#f(0) in the form of (11). 

Due to the strong duality result, the left hand side problem of (12) is referred to as the primal problem and 
the right hand side problem of (12) is referred to as the dual problem. Since the relaxation of the requirement on 
admissible strategies is penalized and compensated by using a proper function in ^#p(0), we can see why h e .^p(0) 



is called a (dual feasible) penalty function. If u* is a control strategy that achieves the supremum on the left side 
of (12), and h* is a dual feasible penalty that achieves the infimum on the right side of (12), then they are referred 
to as the optimal solutions to the primal and dual problem, respectively. The "complementary slackness condition" 
in the next theorem characterizes such a pair (u*,h*), which parallels the discrete-time problem (Theorem 2.2 in 



Theorem 4 (Complementary Slackness) Given m* e ^r(O) and h* e ^#jr(0), a sufficient and necessary condition 
for m* and h* being optimal to the primal and dual problem respectively is that 



where x* is the solution of (6) using the control strategy u* = (wf)r<=[o,r] on [0,0 with the initial condition Xq = x. 

Here, we have the same interpretation on complementary slackness condition as that in the dual formulation of 
MDPs: if the penalty is optimal to the dual problem, the decision maker will be satisfied with an optimal non- 
anticipative control strategy even if she is able to choose any anticipative control strategy. Clearly, if an optimal 
control strategy u* to the primal problem (6)-(7) does exist (see, e.g., Theorem 2(b)), then u* and h*(u,w) defined 
in (11) is a pair of the optimal solutions to the primal and dual problem. However, we note that the optimal penalty 
in the form of (11) is intractable as it depends on the exact value of V(0,x). The next proposition provides some 
motivation to design good penalties. 

Proposition 2 Suppose the value function V(t,x) and the optimal control u* satisfy all the assumptions in Theorem 
2(b). Then h*(u*,w) has the following equivalent form: 



[6]). 



E OtX [h*(u*,w)]=0, 



and 




(13) 




where x* is the solution of (6) using the optimal control u* = (« f *) fG [o.r] on [0,t) with the initial condition x* =x. 
Proof: Proof. Since the value function V(t,x) G C 1 ' 2 (Q) n C P (Q), we can apply the Ito differential rule on 



V(t,x*) given u* = (u*) Q < t < T (note that V (T,x* T ) = M x t )) : 

A*(u*,w) =A(4) + [ T g(t,x* ,u*)dt -V(0,x) 
Jo 

=v(o,x)+ [ T A u 'V(t, x ;)dt+ [ T v x {t,x;) T a(t, x ;)dw, 

Jo Jo 

+ [ g{t,x* t ,u* t )dt-V{0,x) 
Jo 

= [ T (A u 'v(t,x;)+ g (t 7 xiu:))dt+ [ T v x (t,x;) T o(t,x;)d Wt 

Jo Jo 

rT 

= / v x {t, x ;) J a{t,x;)dw t , 

Jo 

where the last equality holds due to (9). ■ 
The optimal penalty h*(u,Yf) can be written compactly as an Ito stochastic integral, when it is evaluated at 
u = u*. A natural question would be whether J Q 7 V x (t ,x ( ) T a(t ,x t )dw t plays the role of an optimal penalty in (12) 
as M*(a,v) does in Theorem 1 achieving the strong duality. Unfortunately, J T V x (t,x t ) J a(t,x t )dw t is not even a 
well-defined object in terms of an Ito stochastic integral, when u is not adapted to F. To fix this problem, we also 
need the machinery of "anticipating stochastic calculus". However, we can still provide a concise answer here, that 
is, there exists an alternative optimal penalty that coincides with f V x (t ,x t ) T o(t ,x t )dw t when u G ^(0). We fully 
develop the relevant results in Theorem 5 in Appendix D-B. In the following proposition we formalize one of 
the main results in Theorem 5, which also guides the numerical approximation scheme that will be illustrated in 
Section III. 

Proposition 3 Suppose the value function V(t,x) defined in (7) satisfies all the assumptions in Theorem 2(b). 
Then under some technical conditions, there is an optimal solution to the dual problem, i.e., an optimal penalty 
h*(u,w) € ^#f(0) in the form of 

h* v (u,w) = [ V x (t,x t ) T a(t,x t )dw t for u € <2fr(0), (14) 
Jo 

where x t is the solution of (6) using the control u = («r)re[o,r] on [0,0 with the initial condition xq = x. 

Since the value functions {V(t,x),Q <t<T} are unknown in real applications, how does Proposition 3 guide 
us to generate a suboptimal penalty given approximate value functions {V(t,x),0 < t < T} that are of sufficient 
regularity? The form of /z*(u,w) implies that it can be approximated by h v (u,v/) = / V x (t ,x t ) T o(t ,x t )dw t at 
least for u e ^r(O). If we further assume V x (t,x) J a (t ,x) satisfies the polynomial growth condition in x, then 
Eo,x[^v(u,w)] = for all and u € ^f(O). As a result, /z v (u,w) € ^#f(0), which means that h v can be used 

to derive an upper bound on the value function V(Q,x) through (10). Therefore, in terms of the approximation 
scheme implied by the form of the optimal penalty, Proposition 3 presents a value function-based penalty that can 
be viewed as the continuous-time analogue of M*(a,v) in (5). 

It is revealed by the complementary slackness condition in both discrete-time (Theorem 2.2 in [6]) and continuous- 
time (Theorem 4) cases that any optimal penalty has zero expectation evaluating at an optimal policy; as a stronger 



version, the value function-based optimal penalty in both cases assign zero expectation to all non-anticipative polices 
(note that M* in (5) is a sum of martingale differences under the original filtration G). 

Intuitively, we can interpret the strong duality achieved by the value function-based penalty as to offset the 
path-dependent randomness in the inner optimization problem; then the optimal control to the inner optimization 
problem coincides with that to the original stochastic control problem in the expectation sense, which is reflected 
by the proof of Theorem 5 in Appendix D-B for controlled Markov diffusions (resp., see the proof of Theorem 
1(b) in [11] for MDPs). This idea should also apply to other continuous-time controlled Markov processes, for 
example, we can directly formulate the dual representation of controlled jump diffusions in a parallel way, and a 
similar result of Proposition 2 can be derived after applying the Ito formula with jumps. 

In addition to the value function-based penalty, [20] (see its Theorem 2. 1 and Theorem 2.2) proposed a Lagrangian 
approach that falls into our dual framework of controlled Markov diffusions, where the Lagrangian term behaves 
like a "penalty" function in the sense that it satisfies the complementary slackness condition developed in Theorem 
4. We find that the derivation of this Lagrangian term is analogous to that of the gradient-based penalty proposed 
by [9] for MDPs; therefore, we review these results in Appendix D-C for reference. 

We note that most of the numerical methods proposed so far for solving (7) focus on the operator of the HJB 
equation (see, for example, [16] and [17]). We will show the practical use of the dual formulation of controlled 
Markov diffusions, especially the value function-based penalty in the form of (14), in solving a dynamic portfolio 
choice problem in the next section. 

Finally, we should point out that though the dual formulation of controlled Markov diffusions established in this 
subsection is valid provided that a is a function of t and x, its validity only relies on the existence of a unique 
pathwise solution (xt) t e[o,T] to (6) with u € ^(0). In other words, the dual formulation remains valid if such a 
solution can be properly defined for a general a that also depends on u. 

III. Dynamic Portfolio Choice Problem 

In this section we will show how the value function-based optimal penalty helps to solve a classic dynamic 
portfolio choice problem with predictable returns and intermediate consumptions. Dating back to [27], [28], [29], 
dynamic portfolio choice problems have become computationally intensive due to more model features incorporated, 
such as position constraints, transaction costs and risk measures. Some recent works along this line include [30], [31], 
[32], [14]. Since most portfolio choice problems of practical interest cannot be solved analytically, various numerical 
methods have been developed to address this problem. These approximation schemes include but not limited to the 
martingale approach [33], [34], state-space discretization methods [35], [36], and approximate dynamic programming 
methods [37], [17]. These methods all induce sub-optimal policies, by performing which it is straightforward to 
obtain a lower bound on the optimal expected utility. However, it is often hard to tell how far the induced policy 
is from the optimal ones. Though some methods bear the property of asymptotic convergence, its accuracy with 
limited computational power cannot be measured. To overcome this problem, [38] and [9] constructed an upper 
bound on the expected utility based on the dual formulation of the constrained portfolio choice problem proposed 



by [39] and the information relaxation duality method proposed by [6], [5], respectively. The gap between the lower 
bound and the upper bound can be used to justify the performance of a candidate policy. 

We focus on solving a discrete-time dynamic portfolio choice problem that is discretized from a continuous-time 
model, which is similar to the one considered in [40] and [39]. We evaluate the lower bound on the optimal expected 
utility for this discrete-time model by performing sub-optimal policies using Monte Carlo simulation. To obtain 
an upper bound on the optimal expected utility simultaneously, we apply the information relaxation dual approach 
and propose a new class of penalties based on the time discretization of the value function-based optimal penalties 
of the continuous-time model; these penalties make the inner optimization problem much easier to solve in terms 
of computation compared with the penalties directly derived from the discrete-time model. We demonstrate the 
effectiveness of our method in computing dual bounds through numerical experiments. 

A. The Portfolio Choice Model 

We first consider a continuous-time financial market with finite horizon [0,T], which is built on the probability 
space (£1,&,F). There are one risk-free asset(cash) and n risky assets the investor can invest on. The risk-free asset 
is denoted by and the instantaneous risk-free rate of return is denoted by rf. Then follows the process 

= r f dt. 

The vector of risky assets is denoted by S t = (Sj,--- ,S") J and it follows a geometric Brownian motion 

^=H,dt + G,dz t , (15) 

where ^ denotes i^tr" :^p-) T > an d z — (zt)o<t<T is an n-dimensional standard Brownian motion. The drift 
Ht = ju(f, 0f) and the volatility o t = o(t,<j> t ) are of proper dimensions and are functions of an m-dimensional market 
state variable (fa that follows another diffusion process 

d<jh = ii? dt + af' l d Zt + o?' 2 dz t , (16) 

where \if — ^ (t,^), of' 1 = O^' 1 (f , of' 2 = o^ ,2 (t,^ t ), and z = (z ( )o<(<r is another (f-dimensional standard 
Brownian motion independent of z. Denote the filtration by F = {,^,0 < t < T}, where & t is generated by 
{{z s ,z s ),0 < s < t}. The covariance matrices O t O^ , of' 1 (a/' 1 ) 1 ", of ' 2 (of ,2 ) T are denoted by Z^zf' 1 , and if' 2 , 
respectively. 

Let % t = (Vj"' j^/ 1 ) 7 denote the fraction of wealth invested in the risky assets. The instantaneous rate of 
consumption is denoted by c t . The total wealth W, of a portfolio that consists of n risky assets and one risk-free 
asset evolves according to 

dW t =W, [nj(n t dt + o t dz t ) + r/(l - 7t^l„)dt - c,di\ 

=W t (nJ{pi t - r f \ n ) + r f -c t )dt + W t nJo t dz„ (17) 

where 1„ is the n-dimensional all-ones vector. The control process u = (t<r)o<(<r with u t = (n t ,c t ) is assumed to 
be an admissible strategy in the sense that 



1) The control u is F-progressively measurable and E[Jq ||M(|| 2 c/f] < °°; 

2) We require W, > 0, c, > 0, and / T W,c,dt < °° a.s.; 

3) We may restrict u t e % ', where ^ is a closed convex set in R" +1 . 

We still use <%fo(t) to denote the set of admissible strategies at time t and we will specify the control space ^ 
later. Suppose U\ and U2 are two strictly increasing and concave utility functions. The investor's objective is to 
maximize the weighted sum of the expected utility of the intermediate consumption and the final wealth : 

V{t,k,W t )= sup E[ ( T ap s Ui{c s W s )ds+{\-a)P T U2{W T )\h,W t l (18) 
ue%(r) J t 

where j3 is the discount rate, and a indicates the relative importance of the intermediate consumption. Since the 
utility gain caused by the intermediate consumption generally means a potential loss of the utility from the final 
wealth, the investor seeks to balance her dynamic portfolio strategy at every time instant. 

The value function (18) sometimes admits an analytic solution, for example, under the assumption that \i t and 
a t are constants and there is no constraint on u t = (n t ,c t ). A recent progress on the analytic tractability of (18) can 
be found in [40]. However, (18) usually does not have an analytic result when there is a position constraint on % t . 

Considering that the investment and consumption can only take place in a finite number of times in the real 
world, we solve the discrete-time counterpart of the continuous-time problem (16)-(18) by discretizing its time 
space. Suppose the decision takes place at equally spaced times {0 = to,h--- such that K = T/S, where 
<5 = h+\ -h for k = 0,1,- • • ,K— 1. We simply denote the time grids by {0, 1, • • • ,K}. Note that (15) is equivalent 
to 

rflog(S,) = {ut - -af)dt + o,dzt, 

where of denotes the vector that consists of the diagonal of E r . That is to say, Sk+i =Rk+\Sk with log(/?£+i) 
Af((jU£ — |(7j)5,Z fc 5), or more precisely, log(^ + i) <~ N(f£g + ^ 5 (n s . — jCJ^)ds,J^g +1 ^ S I. s ds). Hence, we can dis- 
cretize (16), (15), and (17) as follows: 

- <j> k + ntS + o^V5Z k+1 + a^ 2 y/5Z k+l , (19a) 

\og(R k+l ) = (ii k -^ai)8 + a k V5Z k+u (19b) 

W k+ i - W k (Rj +1 n k ) + W k (l - lJn k )R f - W k c k , 

= W k (R f + (R k+l -R f l n ) T n k -c k ), (19c) 

where {(Z k ,Z k ),k= 1, • • • ,K} is a sequence of identically and independently distributed standard Gaussian random 
vectors. In particular, we use Rf = 1 +rf8 and the decision variable c k to approximate e r f 5 and c k 8 due to the 
discretization procedure. 

Here we abuse the notations <j),W, and % in the continuous-time and discrete-time settings. However, the subscripts 
make them easy to distinguish: the subscript t e [0, T] is used in the continuous-time model, while k = 0, • • • ,K is 
used in the discrete-time model. 



Denote the filtration of the process (19) by G = {%, ■ ■ ■ ,&k}, where & k is generated by {(Zj,Zj),j = (),••• ,k}. 
In our numerical examples we assume that short sales and borrowing are not allowed, and the consumption cannot 
exceed the amount of the risky-free asset. Then the constraint on the control a k = (n k ,c k ) f° r the discrete-time 
problem can be defined as 

si = {{n,c) e \n > 0,c > 0,c < R f (l - iJtz)}. (20) 

Since c k is used to approximate c k 8, (20) corresponds to a control set for the continuous-time model, which is 
defined as 

?/ 4 {(n,c) g R n+l \n >0,c>0,c< R f {l - l^n)/8}. 

Let Aq again denote the set of ^/-valued control strategies a= (fli,-" , a K-i) that are adapted to the filtration 
G. The discretization of (18) serves as the value function to the discrete-time problem: 

K-l 

H (h,W ) = sup MY. a ^ Su ^ c k W k)8 + {l-a)^ KS U 2 {W K )] 1 (21) 

aeA G k=0 

which can be solved via dynamic programming: 

Hk($k,W k ) = (1 - a)p K5 U 2 {W K y, 
H k (h,W k )= sup{aj3 fe5 f/i(c^)5+E[// t+1 (^ +1 ,l^ +1 )|^,W fe ]}. (22) 

We will focus on solving the discrete-time model (19)-(21), which is discretized from the continuous-time model 
(16)-(18). Though our methods proposed later can apply on general utility functions, for the purpose of illustration 
we consider the utility functions of the constant relative risk aversion (CRRA) type with coefficient y > 0, i.e, 
U(x) = Ui(x) = U2(x) = y^-p-x 1- r , which are widely used in economics and finance. Due to the utility functions of 
CRRA type, both value functions (18) and (21) have simplified structures. To be specific, the value function to the 
continuous-time problem can be written as 

V(t,<l> t ,W t )=p t W t 1 - r J(t,<l> t ), (23) 

where J(T,(j> T ) = (1 - a) /{I - y), and 

f(t,j>)= sup E[[ T p s -'-2-(c s W s ) l -rds + P T -'}f- -W X T - y \<$> t = <$>W = l]; 
u€%(r) Jt 1 - Y 1 - 7 

and the value function to the discrete-time problem is 

H k ((h,W k ) = p kS wi- r J k {4 k ), (24) 
where J k is defined recursively as Jk{$k) = (1 — °0/(l — 7) an d 

J k (<k)= sup {^4- r 5 + j3 5 E[(/? / + (/? fc+1 -/? / ) T ^-c ft ) 1 -Vi(#+i)l#]}- (25) 

(n k ,c k )esrf 1 7 

It can be seen that the structure of the value functions to both continuous-time model and discrete-time model are 
similar: they can be decomposed as a product of a function of the wealth W and a function of the market state variable 
0. If 8 is small, J(k8,<j>) and J k (<j>) may be close to each other. As a byproduct of this decomposition, another 



feature of the dynamic portfolio choice problem with CRRA utility function is that the optimal asset allocation 
and consumption (n t ,c t ) in continuous-time model is independent of the wealth W t given (j) t (respectively, the 
optimal {%ki c k) m discrete-time model is independent of the wealth W k given fa). So the dimension of the state 
space in (22) is actually the dimension of fa. A number of numerical methods have been developed to solve the 
discrete-time model based on the recursion (25) including the state-space discretization approach [35], [36], and a 
simulation-based method [37]. 

B. Penalties and Dual Bounds 

Since we can use Monte Carlo simulation to evaluate the expected utility under any admissible strategy, we are 
interested in how good the strategy is and how much better we could do. The purpose of this subsection is to provide 
a way to evaluate the quality of the strategies developed for the discrete-time (continuous-state) model (19)-(21). 
Note that this problem falls in the framework of the dual approach for MDPs introduced in Theorem 1, which 
can be used to complement the lower bound on the value function Ho by an upper bound in principle. However, 
Theorem 1 does not directly suggest a tractable approximation scheme on penalty functions for continuous-state 
problems. According to the numerical experiments reported in [6], [9], the choices of the penalties significantly 
influence the quality of the dual bounds. Therefore, we are aiming to accomplish two goals in this subsection: 

• We want to design appropriate penalties that can help to achieve tight dual bounds. 

• We want to keep the computational cost of the inner optimization problem at a reasonable level. 
Throughout this subsection we assume that an approximate function of Jk{<j>), say Jk(<j>), and an approximate 

policy a e Aq are available. We do not require that a should be derived based on /t(0) and vice versa; in other 
words, they can be obtained using different approaches. We first describe the information relaxation dual approach 
of MDPs in the context of our portfolio choice problem. We focus on the perfect information relaxation that assumes 
the investor can foresee the future uncertainty Z = (Z\, ■ ■ ■ ,Z K ) and 2 = (Z\ , ■ ■ ■ 7 Z K ), i.e., all the market states and 
returns of the risky assets. A function M(a,Z,Z) is a dual feasible penalty in the setting of dynamic portfolio 
choice problem if for any (<j>o,Wo), 

E[M(a,Z, Z)|0o, Wq\ <0 for all aeA G . (26) 
Let ^#g(0) denote the set of all dual feasible penalties. For M £ .-#g(0) we define JzfM as a function of (fa,Wo): 



(^M)(0 o ,Wo)=E 



K-\ 

sup{ £ ap k5 U(c k W k )8 + (1 - a)p KS U(W K ) -M(a,Z,Z)}|0o,W o 

aeA £=0 



(27) 



Based on Theorem 1(a), (J? M) (fa, Wo) is an upper bound on // (</>o,Wb) for any M £ ^#g(0) • 

To ease the inner optimization problem, we introduce equivalent decision variables FL. = W^k an d Q = WfcCfc, 
which can be interchangeably used with % k and c k . We still use a to denote an admissable strategy, though in 
terms of (n^Q) now. Then we can rewrite the inner optimization problem in the conditional expectation in (27) 
as follows: 



K-l 

7(0o,Wo,M,Z,Z)=max{£ aP k5 U(Cj)8 + (1 - a)P KS U(W K ) -M(a,Z,Z)} (28a) 

s.t. 0* + i = 0* + ^8 + af ,l y/SZ k+i + o£' 2 V8~Z k+l , 
\og(R k+1 ) = {^i k -^a k )S + a k VSZ k+1 , 

W k+1 = W k R f + (R k+l - Rfl n ) T U k - C k , (28b) 

n,t>0, C k >0, (28c) 

C k <R f (W k -lJU k ), fotk = 0,-,K-l. (28d) 

Note that (28b) is equivalent to (19c), and(28c)-(28d) are equivalent to (20). The advantage of this reformulation 
is that the inner optimization problem (28) has linear constraints. Therefore, we may find the global maximizer of 
(28) as long as the objective function in (28a) is jointly concave in a. 

Heuristically, we need to design near-optimal penalty functions in order to obtain tight dual bounds on Hq. A 
natural approach is to construct a penalty function by replacing the value function in the optimal penalty for the 
discrete-time problem (see (5)) with its approximation. In the following we will see why deriving penalty in this 
way will result in an intractable inner optimization problem. Alternatively, the inner optimization problem becomes 
tractable when we construct a heuristic penalty according to the form of the optimal penalty for the continuous-time 
model. We first investigate the optimal penalty for the discrete-time problem according to (5): 

K-l 

M*(a,Z,Z) = £Aff 4+ i(a,Z,2). 

k=0 

Based on the factorized structure of H k in (24), we can obtain 

AH k+1 (a,Z,Z) =H k+ i {h+i , W k+1 ) - E[H k+1 (h+i, W k+1 )\ fa,W k ,a k ] 

=j5(*+ 1 )*(K:(a,Z,Z)-E[K:(a,Z,Z)|^,Wt,fl t ]) ) (29) 

where 

K:(a,Z,Z)4(^ / + (^ +1 -^ / i B ) T n t -C0 1 ~ r /t+i(^+i). 

We can explicitly write out fc+1 = ^ +1 (0o,Z 1:fc+ i,Z 1:jfe+1 ) and R k+l = R k+ i(^ Q ,Z i:k+u Z hk+ i) via (19a)-(19b) to 
emphasize the dependence on the randomness sequence (Z,Z), where Zi- k (resp., Zi :k ) denotes the first k entries 
of Z (resp., Z). In practice we can approximate J k (<j) k ) by J k {ty k )\ however, it does not mean that an approximation 
of AH k+ i can be easily computed, since an intractable conditional expectation over (m + n)-dimensional space is 
involved in (29). Another difficulty is that M* = Y%=o &Hk+\ enters into (28a) with possibly positive or negative 
signs for different realizations of (Z,Z), making the objective function of (28) nonconcave, even if U\ and U2 are 
concave functions. Therefore, it might be extremely hard to locate the global maximizer of (28). 

To address these problems, we exploit the value function-based optimal penalty h* v for the continuous-time 
problem (16)-(18), assuming that all the technical conditions are satisfied. If we disregard the dependence of the 



diffusion coefficient of W t (the second term on the right side of (17)) on % u then according to Proposition 3 we 
can formally write h* as 




= E [ ik+l)5 [V$(t,<Pt 1 W t ) J o?' l dz t + V^t,<p tl W t ) J o?' 2 d~z t 

+ V w (t,<j> t ,W t )W t 7t t a t dz,], 
= Z / ( " +1)5 j3'[w/-%/(f,^) T ^' 1 *r + w/-%/( f ^) T a/' 2 ^ 

k=0 JkS 

+ (1 - y)w/~ r /(f , ft)*,* dz t ] , (30) 

for u = (7t t} c t )()< t <T € %r(0), and the last equality holds due to structure of the value function (23). In particular, 
we use to denote the gradient of the function / with respect to 0. Motivated by the fact that our discrete-time 
model is discretized from the continuous-time model, we propose the following function that approximates each 
AH k+i in AT, that is, 

p kS [W l k - 7 V^ k ) T of A V8Z k+x +W k l -^ (j ,J k ^ k ) T af a V5Z k+1 + (1 - y)W[ y J k ^ k )U T k a k ^8Z k+l ] . (31) 

It is obvious that this term is derived by discretizing the Ito stochastic integrals in the (k+ l)-th term of the 
summation in (30), and we will justify that it is a dual feasible penalty for the discrete-time problem later. For the 
purpose of incorporating this term to the dual approach in the numerical implementation, we should note that the 
value of Jk{§k) an d the differentiability of J k {ty) in are not known yet. However, as mentioned in the beginning 
of Section III, we can approximate J k (((> k ) using different approaches that lead to piecewise linear functions (by 
state-space discretization method) or smooth functions (by approximate dynamic programming method), denoted 
by -4(00- Hence, V $J k (§k) can be formally approximated by the gradient of these approximate functions, namely, 
V(j>fk((l>k)- We will formalize in Proposition 4 that the approximations in J k {§ k ) and V^J^fa) will not influence the 
validity of (31) being a dual feasible penalty for the discrete-time problem (i.e., condition (26) is satisfied). 

We describe the procedure of evaluating an applicable penalty function using simulation. We first generate a 
realization of (Z,Z) and consequently we can obtain (j) k = § k (§Q,Z k ,Z k ), a k = a(<$> k ), of' 1 = o^' 1 (k,(j> k ), af' 2 = 
<J^' 2 {k,§ k ), fk{(j>k)> an d V <l>fk{<i>k)> w i m an Omissible strategy a = (do,-- - ,a K ), we can also obtain 
W k = W k (Wo 7 a(^>Q,Wo,Z k ,Z k ),Z k ,Z k ) via (19c) as an approximation to the wealth under the optimal policy. Then 
we can approximate M* (a,Z,Z) by 

K-l 

Afi(a,Z,Z)4 Ei8* 5 [ l Pl + i(a,Z,Z)+^ +1 ( ajZj z)+^ +i(a;Z) z ) ] ) (32) 

k=0 

where 

*i +1 (a,Z,2) =wl~ y V <> J k ^ k ) T af' 1 V5Z k+u 
*2 +1 (a,Z,2) =W k l -^^ k ) T af' 2 V5Z k+u 
¥f +1 (a,Z,Z) =(1 - Y)W k r fk(fa)njG k V8Z k+l . 



Note that when a realization of (Z,Z) is fixed, and *Pf +1 are constants with respect to a (but varies across 
sample paths), which can be seen as control variates; depends on Yl k (hence, on a), and thus is the only term 
that contributes to the inner optimization problem (28). Since is affine in Yl k , the objective function (28a) is 
jointly concave in a with M = M\. As a result, the inner optimization problem (28) remains a convex optimization 
problem and can be easily solved. In our numerical experiments, we will consider dual bounds generated by this 
penalty. 

To find some variants of the penalties while still keeping the inner optimization problem convex, we also generate 
*r£ + i based on a first-order Taylor expansion of around the strategy a k _\ — (ftfc_i,Q_i): 

% +l (a, z, z) = [wl - y + ( i - y)w[ r ( (R k - R f i n ) T (n fc _ ! - n t _o 

-(Ch-Ch))] ■VtJ k ($ k ) T o*' 1 VSZ k+l , (33) 

where R k = R k (<j>o,Z k ,Z k ), U k _ x =ni-i(^,Wo>Vi,Zn), and Q_i = Q_i(0o, Wo,Zjt_i,Z t _i). Then >Pj +1 is 
linear in n^_i and We can also obtain a variant of VPjt +1 , say *rf +1 , in exactly the same way. Since VPj +1 is 

already linear in n^, we do not linearize it with respect to a k _\. In our numerical experiments we will also consider 
dual bounds generated by 

K-l 

M 2 (a,Z,Z)^ £ j3^[*' +1 (a,Z,Z)+*2 +1 ( ajZj z)+^ +i(ajZjZ) ]. (34) 

k=0 

To go further, we can also generate a penalty function by linearizing *Pj +1 around («(v ,a k -\). Finally, we 
justify the validity of M\ and M 2 being dual feasible penalties in Proposition 4. 

Proposition 4 The functions M\ in (32) and M2 in (34) are dual feasible penalties, i.e., M\,M 2 € ^#g(0). Hence, 
jSfMi(0o,Wb) fl«<i Jz?M2(0o, Wb) are M/5/?er bounds on the value function Hq((^q,Wq). 

Proof: Proof. We observe that with a fixed non-anticipative policy a € Aq, it is obvious that (j) k , W k , f k ((j) k ), 
V<pf k (<j) k ), O k , and of'' , j = 1,2, are naturally ^-adapted for = 0, • • • ,K— 1. We also note that Tl k is ^-adapted 
due to a G Aq. Since Z k+ \ and Zj+i have zero means and are independent of <S k and (0o,Wb)> we nave f° r an Y 

(00, 

E^ +1 (a,Z,Z)|^,W ]=0 for all a e A G , 

for / = 1,2,3. So E[Mi(a,Z,Z)|0o,Wb] = for all a e Aq, and hence M\ € ^q(0). Since the same argument can 
apply on 4" k+1 (a,Z,Z) for i = 1,2, it can be concluded that M 2 € ^q(0). ■ 
The penalty function in the form of (32) or (34) bear several advantages: first, it can be evaluated without 
computing any conditional expectation, i.e., a substantial computational work can be avoided; second, the design 
of the penalty function is quite flexible: we can use any admissible policy to obtain a valid penalty, and we can 
choose to do a linearization around this policy, which makes the inner optimization problem (28) convex and 
computationally tractable. 



C. Numerical Experiments 

In this section we discuss the use of Monte Carlo simulation to evaluate the performance of the suboptimal 
policies and the dual bounds on the expected utility (21). We consider a model with three risky assets (n = 3) 
and one market state variable (m = 1). We choose T = 1 year and 8 = 0.1 year in our numerical experiments. In 
addition, we use a = 0.5 for the weight of the intermediate utility function and use j3 = 1 as the discount factor. 
Other information on the state equation (19) can be found in Appendix E. In particular, the market state variable 
{fa} follows a mean-reverting Ornstein-Uhlenbeck process: it has relatively small mean reversion rate and volatility 
in parameter sets 1 and 3, while it has relatively large mean reversion rate and volatility in parameter sets 2 and 
4. We assume 0o = and Wo = 1 as the initial condition and impose the constraint (20) on the control space g/ in 
the following numerical tests. 

For each parameter set we first use the discrete state-space approximation method to solve the recursion (25). In 
particular, we approximate the market state variable fa using a grid with 21 equally spaced girds from —2 to 2, 
and the transition between these grid points is determined by (19a) noting that fa + \ ~ N(fa + nfS,(\\ of' 1 || 2 + | 
of' 2 || 2 )5); the random variables Zj. and are approximated by Gaussian quadrature method with 3 points for 
each dimension (see, e.g., [41]). So the joint distribution of the market state and the returns are approximated by a 
total of 3 3 x 21 = 567 grid points, which are used to compute the conditional expectation in (25): we assume fa + \ 
and Rk + \ are independent conditioned on fa, then the conditional expectation reduces to a finite weighted sum. For 
the optimization problem in (25) we use CVX ([42]), a package to solve convex optimization problems in Matlab, 
to determine the optimal consumption and investment policy on each grid of fa at time k. We record the value 
function and the corresponding policy on this grid at each time k = 0,- • • ,K. Note that the market state variable 
fa is one dimensional, so the value function and the policy can be naturally defined on the market state fa that 
is outside the grid by piecewise linear interpolation. In our numerical implementation the extended value function 
and the extended policy play the roles of the approximate value function Jk(fa) and the approximate policy a to 
the discrete-time problem (19)-(21); and we take the slope of the piecewise linear function Jk{<j>) as V^.4(0), if <j> 
is between the grid points; otherwise, we can use the average slope of two consecutive lines as V^^.(0). 

We then repeatedly generate random sequences of (Z,Z), based on which we generate the sequences of market 
states and returns according to their joint probability distribution (19)-(21). Then we apply the aforementioned 
policy a on these sequences to get an estimate of the lower bound on the value function Hq; based on each random 
sequence we can also solve the inner optimization problem (28) with penalty M\ in (32) or M2 in (34), which 
leads to an estimate of the upper bound on Hq. We present our numerical results in the following tables: the lower 
bound, which is referred to as "Lower Bound", is obtained by generating 100 random sequences of (Z,Z) and their 
antithetic pairs (see [43] for an introduction on antithetic variates) in a single run and a total number of 10 runs; 
the upper bounds induced by penalties M\ and M2, which are referred to as "Dual Bound 1" and "Dual Bound 
2" respectively, are obtained by generating 30 random sequences of (Z,Z) and their antithetic pairs in a single 
run and a total number of 10 runs. To see the effectiveness of these proposed penalties, we use zero penalty and 



repeat the same procedure to compute the upper bounds that are referred to as "Zero Penalty" in the table. These 
bounds on the value function Hq (i.e., the expected utility) are reported in the sub-column "Value", where each 
entry shows the sample average and the standard error (in parentheses) of the 10 independent runs. We also list 
the certainty equivalent of the expected utility in the sub-column "CE" (this is reported in the literature such as 
[33]), where "CE" is defined through U (CE) = Value. For ease of comparison, we compute the duality gap - the 
smaller differences of lower bounds with two upper bounds on the expected utility and its certainty equivalent - 
as a fraction of the lower bounds in the column "Duality Gap". 

TABLE I 
Results with Parameter Set 1 





Lower Bound 


Dual Bound 1 


Dual Bound 2 


Zero Penalty 


Duality Gap 


r 


Value 


CE(10" 1 ) 


Value 


CE(10 _1 ) 


Value 


CE(10 _1 ) 


Value 


CE(10 _1 ) 


Value CE 


1.5 


-5.480 


1.332 


-5.391 


1.376 


-5.392 


1.376 


-4.861 


1.693 


1.61% 3.30% 




(0.003) 


(0.001) 


(0.008) 


(0.004) 


(0.007) 


(0.004) 


(0.012) 


(0.008) 




3.0 


-42.887 


1.080 


-39.227 


1.129 


-39.873 


1.120 


-27.562 


1.347 


7.53% 3.70% 




(0.036) 


(0.001) 


(0.164) 


(0.002) 


(0.317) 


(0.004) 


(0.252) 


(0.006) 




5.0 


-2445.9 


1.005 


-2066.5 


1.049 


-2025.5 


1.054 


-1105.7 


1.226 


15.51% 4.38% 




(1.635) 


(0.001) 


(22.019) 


(0.003) 


(17.833) 


(0.002) 


(16.438) 


(0.004) 





TABLE II 
Results with Parameter Set 2 





Lower Bound 


Dual Bound 1 


Dual Bound 2 


Zero Penalty 


Duality Gap 


r 


Value 


CE(10" 1 ) 


Value 


CE(10" 1 ) 


Value 


CE(H)- 1 ) 


Value 


CE(10" 1 ) 


Value CE 


1.5 


-5.466 


1.339 


-5.380 


1.382 


-5.381 


1.381 


-4.864 


1.691 


1.56% 3.14% 




(0.005) 


(0.001) 


(0.011) 


(0.006) 


(0.015) 


(0.008) 


(0.020) 


(0.008) 




3.0 


-42.585 


1.084 


-39.645 


1.123 


-39.690 


1.122 


-27.708 


1.343 


6.80% 3.51% 




(0.081) 


(0.001) 


(0.229) 


(0.003) 


(0.155) 


(0.002) 


(0.209) 


(0.005) 




5.0 


-2431.6 


1.007 


-2043.8 


1.052 


-2040.7 


1.052 


-1122.1 


1.222 


15.95% 4.47% 




(7.510) 


(0.001) 


(11.881) 


(0.002) 


(19.882) 


(0.003) 


(9.842) 


(0.004) 





We consider utility functions with different relative risk aversion coefficients y = 1.5,3.0, and 5.0, which reflect 
low, medium and high degrees of risk aversions. The dual bounds induced by zero penalty perform poorly as we 
expected. On the other hand, it is hard to distinguish the performance of "Dual Bound 1" and "Dual Bound 2", 
which may imply that X P| +1 plays an essential role in the inner optimization problem in order to make the dual 
bounds tight in this problem. We observe that the duality gaps on the value function Hq are generally smaller when 
y is small, implying that both the approximate policy and penalties are near optimal. For example, when y = 1.5, 
the duality gaps are within 2% of the optimal expected utility for all sets of parameters. As y increases, the duality 
gaps generally become larger. 



TABLE III 
Results with Parameter Set 3 





Lower Bound 


Dual Bound 1 


Dual Bound 2 


Zero Penalty 


Duality Gap 


r 


Value 


CE(K)- 1 ) 


Value 


CE(K)- 1 ) 


Value 


CE(10-') 


Value 


CE(10" 1 ) 


Value CE 


1.5 


-5.439 


1.352 


-5.376 


1.384 


-5.376 


1.384 


-4.904 


1.663 


1.16% 2.37% 




(0.002) 


(0.001) 


(0.008) 


(0.004) 


(0.007) 


(0.003) 


(0.014) 


(0.009) 




3.0 


-41.961 


1.092 


-38.221 


1.144 


-38.800 


1.135 


-28.5874 


1.322 


7.53% 3.94% 




(0.074) 


(0.001) 


(0.243) 


(0.003) 


(0.241) 


(0.003) 


(0.168) 


(0.004) 




5.0 


-2402.6 


1.005 


-1997.3 


1.049 


-2011.5 


1.056 


-1157.3 


1.212 


16.28% 4.38% 




(7.243) 


(0.001) 


(16.467) 


(0.002) 


(15.288) 


(0.002) 


(14.836) 


(0.004) 














TABLE IV 


















Results with Parameter Set 4 










Lower Bound 


Dual Bound 1 


Dual Bound 2 


Zero Penalty 


Duality Gap 


r 


Value 


CE(10-') 


Value 


CE(10-') 


Value 


CE(10-») 


Value 


CEflO" 1 ) 


Value CE 


1.5 


-5.441 


1.351 


-5.384 


1.380 


-5.375 


1.385 


-4.895 


1.669 


1.05% 2.15% 




(0.001) 


(0.001) 


(0.004) 


(0.002) 


(0.008) 


(0.004) 


(0.008) 


(0.005) 




3.0 


-41.933 


1.092 


-38.432 


1.141 


-38.664 


1.137 


-28.661 


1.321 


7.80% 4.12% 




(0.088) 


(0.001) 


(0.380) 


(0.006) 


(0.218) 


(0.003) 


(0.127) 


(0.003) 




5.0 


-2394.2 


1.011 


-1984.6 


1.059 


-2005.9 


1.057 


-1135.2 


1.218 


16.22% 4.55% 




(5.130) 


(0.001) 


(12.951) 


(0.002) 


(13.758) 


(0.002) 


(14.353) 


(0.004) 





There are several reasons to explain the enlarged duality gaps on the value function with increasing y. Note 
that the utility function U(x) is a power function (with negative power of 1 — y) of x and it decreases at a higher 
rate with larger y, as x approaches zero. This is reflected by the fact that both the lower and upper bounds on 
the value function Hq decrease rapidly with higher value of y. In the case of evaluating the upper bounds on 
Ho, it can be inferred that with larger y the objective value (28a) is more sensitive to the solution of the inner 
optimization problem (28), and hence the quality of the penalty functions. In other words, even a small torsion 
of the optimal penalty will lead to a significant deviation of the dual bound. In our case the heuristic penalty is 
derived by discretizing the value function-based penalty for the continuous-time problem, however, this penalty 
may become far away from optimal for the discrete-time problem when y increases. Similarly, obtaining tight lower 
bounds on the expected utility by simulation under a sub-optimal policy also suffers the same problem, that is, 
solving a sub-optimal policy based on a same approximation scheme of the recursion (25) may cause more utility 
loss with larger y. The performance of the sub-optimal policy also influences the quality of the penalty function, 
since the penalties Mi and M2 involve the wealth Wk induced by the suboptimal policy and its error compared 
with the wealth under the optimal policy will be accumulated over time. Hence, the increasing duality gaps on the 
value function with larger risk aversion coefficients are contributed by both sub-optimal policies and sub-optimal 
penalties. 



These numerical results provide us with some guidance in terms of computation when we apply the dual approach: 
we should be more careful with designing the penalty function if the objective value of the inner optimization 
problem is numerically sensitive either to its optimal solution or to the choice of the penalty function. Fortunately, 
the sensitivity of the expected utility with respect to y in this problem is relieved to some extent by considering 
its certainty equivalent. We can see from the table that the differences between the lower bounds and the upper 
bounds in terms of "CE" are kept at a relatively constant range for different values of y. 

IV. Conclusion 

In this paper we study the dual formulation of controlled Markov diffusions by means of information relaxation. 
This dual formulation provides new insights into seeking the value function: if we can find an optimal solution 
to the dual problem, i.e., an optimal penalty, then the value function can be recovered without solving a HJB 
equation. From a more practical point of view, this dual formulation can be used to find a dual bound on the value 
function. We explore the structure of the value function-based optimal penalty, which provides the theoretical basis 
for developing near-optimal penalties that lead to tight dual bounds. As in the case of MDPs, if we compare the 
dual bound on the value function of a controlled Markov diffusion with the lower bound generated by Monte Carlo 
simulation under a sub-optimal policy, the duality gap can serve as an indication on how well the sub-optimal 
policy performs and how much we can improve on our current policy. Furthermore, we also expose the connection 
of the gradient-based optimal penalty between controlled Markov diffusions and MDPs in Appendix. 

We carried out numerical studies in a dynamic portfolio choice problem that is discretized from a continuous-time 
model. To derive tight dual bounds on the expected utility, we proposed a class of penalties that can be viewed as 
discretizing the value function-based optimal penalty of the continuous-time problem, and these new penalties make 
the inner optimization problem computationally tractable. This approach has potential use in many other interesting 
applications, where the system dynamic is modeled as a controlled Markov diffusion. Moreover, by examining the 
duality gaps on the expected utility with different parameters, we find that the objective function in the primal 
problem may largely influence the sensitivity of the optimal solution to the dual problem, and hence the quality 
of the dual bounds. These numerical studies complement the existing examples on applying the dual approach to 
continuous-state MDPs. 

This dual approach also sheds light on some future directions. First, we attempt to study more practical methods 
that can apply the dual approach on general (continuous-state) MDPs. For example, a new type of the gradient- 
based penalty has been presented in [44]. Second, we would like to formulate the dual representation of other 
continuous-time controlled Markov processes. An analogue of Proposition 2 or Proposition 3 may be established 
as long as the evolution of the value function under the state dynamics can be explicitly represented; if the value 
function-based penalty admits simple structure (under natural filtration) as that in the setting of controlled Markov 
diffusions, it may have the potential to generate dual bounds easily in terms of computation. 
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Appendix A 

Technical Assumptions on Controlled Markov Diffusions 
In integral form, (6) can be written as 



The second term on the right side of (35) is an Ito stochastic integral. We say that a process (x t )o< t <T is a pathwise 
solution of (6) or (35) if it is F-progressively measurable and has continuous sample paths almost surely (a.s.) given 

x =xeR n . 



standard assumptions: 

Assumption 1 b and a are continuous on Qxty, and for some constants C\ , C2 > 0, 

1) || b(t,x,u) || + || a(t,x,u) ||< Ci(l+ || x || + || u ||) for all (t,x,u) E Q x a i/ ; 

2) || b(t,x,u) — b{s 1 y 1 u) \\ + \\ a(t,x,u) — a(s,y,u) \\< C2(\t — s\+ \\ x — y ||) for all (t ,x), (s,y) € Q. 

To guarantee Theorem 2 to hold, we also assume that A and g satisfy the following polynomial growth conditions: 

Assumption 2 For some constants Ca,ca,Q,c s > 0, 

1) |A(x)| <C A (l+\\x\\ c *)forallxeW; 

2) \g(s,x,u)\ < C g (l+ || x \\ c i + || u \\ c s)for all (t,x) G Q. 



2006. 




(35) 



To guarantee a unique pathwise solution (xr)re[o,r] t0 (6) or (35) when u <G ^f(0), we impose the following 



Appendix B 



Proof of Theorem 3 



We note that the left side of (12) is the definition of V(0,x). By weak duality result, the left side of (12) is less 
than or equal to the right side. We only need to show that with h = h* in (11), the right side of (12) is equal to the 
left side. This is done by the argument before the statement of Theorem 3. 

Appendix C 
Proof of Theorem 4 

We first consider sufficiency. Let u* S ^f(O) and h* e JZ ¥ (0). We assume Eo^[ft*(u*, w)] = and (13) holds. 
Then by weak duality, u* and h* should be optimal to the primal and dual problem, respectively. 
Next we consider necessity. Let u* G ^f(O) and h* € ^dfe(0). Then we have 



>/(0,x;u*). 

The last inequality holds due to h* e ^#f(0). Since we know u* and h* are optimal to the primal and dual problem 
respectively, by strong duality result, we have 



which implies all the inequalities above are equalities. Therefore, we know Eo,x[/z*(u*, w)] = and (13) holds. 



In this section we aim to develop the value function-based penalty as a solution to the dual problem on the right 
side of (12), which can be viewed as the counterpart of (5) in the setting of controlled Markov diffusions. For 
this purpose we introduce the anticipating stochastic calculus and anticipating stochastic differential equation in 
Appendix D-A, and present the value function-based optimal penalty in Appendix D-B. In Appendix D-C, we will 
review a Lagrangian approach proposed by [20], which falls in our dual framework of controlled Markov diffusions; 
the Lagrangian term they proposed satisfies the complementary slackness condition developed in Theorem 4, and 
hence it behaves like a "penalty" function. Moreover, we compare the procedure of deriving this Lagrangian term 
with that of the gradient-based penalty proposed by [9] for MDPs, in order to expose their similarities. 

Throughout this section we assume that a in (6) only depends on t and x. To be specific, 



where b : [0, T] x W x 9/ -> W and a : [0,T] xR"4 M" xm . Besides the technical assumption in Appendix A, we 
further assume the gradient of a(t,x) with respect to x exists and it is continuous and bounded on Q. The reason 






Appendix D 



Complement of Section 2.1 




(36) 



to suppress the dependence of a on u is to extend the definition of a solution to (36) with anticipative controls. 
When the stochastic integral is defined in Ito sense, SDE (36) has a well-defined solution provided that the control 
strategy u <G ^f(O); however, if an anticipative control strategy uef (0) is considered, we need to first define a 
stochastic integral with respect to an anticipative process and then define a solution to an anticipating stochastic 
differential equation (see, e.g., [19], [18], [20]). This solution is an extension of that to the (regular) SDE in the Ito 
sense, i.e., it should coincide with the solution to the SDE in Ito sense when u € ^(0). We follow [20] to define 
such a solution using the decomposition technique, and for this purpose we require that a is a function of only t 
and x. 

The reward functional to be maximized and the value function are the same as in (7): 

V(t,x) = sup J(t,x;u), (37) 

ue%(f) 

where 

f T 

J(t,x;u) = E tjX [A(x T ) + J g(s,Xs,u s )ds]. 

The partial differential operator A" is then redefined as 

A u L(t ,x) = L t (t,x) + L x (t,x) T b(t,x,u) + itr(L«(f,x)((ja T )(f,x)), L e C L2 (Q). 

A. Anticipating Stochastic Calculus and Anticipating Stochastic Differential Equation 

There are several ways to integrate stochastic processes that are not adapted to Brownian motions such as Skorohod 
and (generalized) Stratonovich integrals (see, e.g, [19], [18]). In this subsection we present the Stratonovich integral 
and its associated Ito formula. Then we define the solution to the anticipating stochastic differential equation in the 
Stratonovich sense. 

We first assume that w = {w,) te [Q T ] is a one-dimensional Brownian Motion in the probability space (Q.,JP,F). 
We denote by / an arbitrary partition of the interval [0, T] of the form I = {0 = to < t\ < ■ ■ ■ < t„ = T} and define 

|/| = sup < ( -<„_ 1 (f i+1 -ti). 

Definition 2 (Definition 3.1.1 in [19]) We say that a measurable process y — (yt)te[0,T] suc ^ l h at Jo \yt\dt < °° a.s. 
is Stratonovich integrable if the family 

, fT ""Vrhl-w, 
S = L y<L . h iM+l \{t)dt 

JO j= Q — t[ 

converges in probability as |/| — > 0, and in this case the limit will be denoted by J T y t odw t . 

Remark 2 We can translate an Ito integral to a Stratonovich integral and vice versa. Ify = (yt)te[0,T] !S a continuous 
semimartingale of the form 

y t =yo+ [ v s ds+ ( £ s dw s , 
Jo Jo 



where (Uf) fe [o,r] and (Ct)te[o.T] are adapted processes taking value in W and R" xm such that 

f T || V s || ds < °° and f T \\ Q s \\ 2 ds < °° a.s.. Then y is Stratonovich integrable on any interval [0,f], and 



f y s odw s = [ y s dw s + (y,w) t = f y s dw s + \ I £ s ds, 
o Jo Jo 2 Jo 



(38) 

where (y,w} t denotes the joint quadrature variation of the semimartingale y and the Brownian motion w. Definition 
2 and the equality (38) can be naturally extended to the vector case. 

Then we present the Ito formula for Stratonovich integral in Proposition 5, the detail of which can be found in 
Section 3.2.3 of [19]. 

Proposition 5 (Theorem 3.2.6 in [19]) Let w = {w} ,w^)te[0,T] ^e an m-dimensional Brownian motion. Sup- 
pose that yo e D 1 ' 2 , v s G L 1 ' 2 , and £ ! e L^' 4 , i = 1,- ■ ■ ,m. Consider a process y — {y t ) t e[o,T] of the form 

rt m rt 

y t - yo + / Vsds + £ / Q o dw' v o<t<T. 
Jo ~[ Jo 

Assume that (yt)o<t<T has continuous paths. Let F : R" — > R be a twice continuously differentiable function. 
Then we have 

rt m r-t 

Fiyt) = F(y ) + / F y (y s ) T v s ds + £ / [F y (y,) T Q o dw l s , < t < T, (39) 
Jo " Jo 

where F y (-) denotes the gradient of F w.r.t. y. 

Proposition 5 basically says that the Stratonovich integral obeys the ordinary chain rule. 

Based on the definition of Stratonovich integral and Remark 2, we generalize SDE (36) to the Stratonovich sense 
(referred to as S-SDE) by letting y t = a'(t,x t ). Then (36) is equivalent to 

[' _ m ft 

x,=x+ b(t,x t ,u,)dt + Y a'(t,Xt)odw' n 0<f<r, (40) 
Jo £[Jo 

where a' : [0,T] xlMl" is the i-th column of a, i = 1, • • • ,m, and b(t,x,u) = b(t,x,u) - \ Y!d=\ o^a^t.x). Here 
aj(7'(f,x) denotes annxl vector with YJ\=\ ( f ,x)o-!'(t ,x) being its ^-th entry and a kl (-) is the ^-th component 



-7=1 ■ 

of C7 ! ( ). Since the stochastic integral in (40) is in the Stratonovich sense, S-SDE (40) adopts its solution in the space 
of measurable processes, which may not be adapted to the filtration generated by the Brownian motion. Therefore, 
we are allowed to consider anticipative policies u € (0) in (40). 

Finally, we need to ensure the existence of a solution to S-SDE (40) if the control strategy u € ^ (0) is anticipative. 
Following [20], [18], we have a representation of such a solution using the decomposition technique: 

x t = Unt), (4D 

where {^t(x)} t e[o.T] denotes the stochastic flow defined by the adapted equation: 

m 

d% t = Y,<y i {t,% t )odw i t , 

i=l 
j m 

= ^a i x a i {t,S )t )dt + a{t,S )t )dw t , $ Q =x, (42) 



and [T]t)te[O t T] solves an ordinary differential equation: 




(43) 



where -jj^ denotes the n x n Jacobian matrix of § with respect to x. Under some technical conditions (see Section 
1 of [20]), the solution (41) is defined almost surely: note that § does not depend on the control u t , i.e., it is the 
solution to a regular SDE in the Ito sense; rj t is not defined by a stochastic integral so it is the solution to an ordinary 



flow of diffeomorphisms a.s..). Hence, x t = |f (fy) is well-defined regardless of the adaptiveness of u = (M()o<r<r- 
To check that x t = §f(TJ f ) satisfies (36), we need to employ a generalized Ito formula of (39) for Stratonovich 
integral (see Theorem 4.1 in [18]). 

B. Value Function-Based Penalty 

The tools we have introduced in the last subsection, especially the Ito formula for Stratonovich integral, enable 
us to develop the value function-based optimal penalty for the controlled Markov diffusions. This penalty, denoted 
by /z*(u,w), coincides with J V x (t ,x t ) J o(t,x t )dw t , when u € ^(0). 

Theorem 5 (Value Function-Based Penalty) Suppose the value function V(t,x) for the problem(36)(37) (or (40)(37)) 
satisfies all the assumptions in Theorem 2(b). We also assume that the Ito formula for Stratonovich integral (39) 
is valid with F = V(t,x) and y = (£,*f) fg [o,r]> where (x t ) t e[o.T] is tne solution to (40). Define 



differential equation parameterized by w (note that -Jr is well-defined a.s. for (t,x) € [0,T] x R", because % t (x) is 





i=l 



(44) 



Then 



1) If u <E ^f(O), (44) reduces to the following form 




and h*{u,w) e ^#f(0). 
2) The strong duality holds in 




(45) 



Moreover, the following equalities hold almost surely with xq = x 




(46) 




(47) 



where (x*) f€ [ 0) :r] is me solution of (36) using the optimal control u* = (w*)( € [o,r] (defined in Theorem 2(b)) 
on [0,f) with the initial condition x^ = x. 



Proof: Proof. Suppose u £ ^f(O) and let y, = V x (t ,x t ) J a'(t,x t ) in Remark 2 for i = 1,- • • ,m. We can imme- 
diately obtain 



h* v (u,w) = Y V x (t,x t ) T a'(t,x,)dw l t = V x {t,x t ) T a(t,x,)dw t . 

i=L J ° Jo 

Note that V x and a both satisfy a polynomial growth, since V(t,x) e C 1,2 (<2) nC p (<2) (also see Appendix A). Then 
we have 

E 0tX [\\ J o V x (t,x t ) T o(t,x t )fdt] 

and therefore, E 0i *[/**(u,w)] = when u e %(0). Hence, /i*(u,w) e ^#f(0). 

Then we show the strong duality (45). According to the weak duality (i.e., Proposition 1), 

V(0,x)<E Q . x [ sup {A(x r )+ [ g(t,x t ,u t )dt-h* v (u,w)}}. (48) 

Next we prove the reverse inequality. Note that with xq = x, 

A(x T )+ [ g(t,xt,u t )dt-h*(u,\r) 
Jo 

=v(o,x)+ / [v t (t,x t )+v x (t,x t yb(t,x,,u t )]dt 

Jo 

+ £ / \y x (t,x t ) T o'(t,x t )]odwt-K(u,w) 
i=l J o 

=V(0,x)+ / [g(t )+A"'V(t,x t )]dt, 
Jo 

where the first equality is obtained by applying Ito formula for Stratonovich integral (i.e., Proposition 5) on V(t,x) 
with V(T,xt)=A(x t ): 

V(T,x T )=V(0,x )+ [ T [V t (t,x t ) + V x (t,x t ) T b(t,x t ,u t )}dt 
Jo 

m „r 

+ E / [Vxit^&it^odw). 

£[Jo 

Since we assume the value function satisfies all the assumptions in Theorem 2(b), there exists an optimal control 

u* = (w*)fe[0,r] with u* = u*(t,x t ) and it satisfies 

g(t,x,u*(t,x)) +A u ' i '- x) V(t,x) = max{g(t,x,u) +A u V(t,x)} = 0, 



lie?/ 



then we have 



sup {A(x r )+ / g(t,x t ,u t )dt-h* v (\i,vi)} 

UG^(O) J® 

sup {V(0,x)+ [ [g(t,x t ,u t )+A l "V(t,x t )]dt} 
ue*(o) Jo 

< V(0,x)+ [ T sup{g(t,x„u)+A u V(t,x t )}dt (49) 
Jo ue ®> 

= V(0,x) + £ [g(t,x*, U ;)+A u 'V(t,x*)]dt 

= V(0,x). (50) 



Taking the conditional expectation on both sides, we have 

V(0,x) >E ,4 sup {A(x T )+ [ g(t,x t ,u t )dt-h* v (u,w)}]. 

uG^-(O) JO 

Together with the weak duality (48) , we reach the equality (45). 

Due to the fact of the equality (45)(in expectation sense) and the pathwise inequality (50), we find that the only 
inequality (49) (that makes (50) an inequality) should be an equality in almost sure sense. So the equality (46) 
holds immediately in almost sure sense. To achieve the equality in (49), the optimal control u* should be applied, 
which implies the equality (47). ■ 

By imposing the value function-based optimal penalty the objective value of the dual problem is equal to V(0,x) 
not only in the expectation sense, but also in the almost sure sense. Therefore, we can view the dual approach as a 
variance reduction technique. In particular, h* plays the role of control variates. As another obvious fact, ft*(u,w) 
evaluated at u = u* is equal to h* (u* , w) in Proposition 2. 

C. Gradient-Based Penalty 

In this subsection we review the results in [20], where a Lagrangian term is proposed to penalize the relaxation 
of the requirement on non-anticipative control strategies. [20] characterizes the properties of this Lagrangian term, 
which coincides with the complementary slackness condition developed in Theorem 4 if it is regarded as a penalty 
function. We will show the "gradient-based" flavor of this Lagrangian term by comparing it with the gradient-based 
penalty proposed in [9] for MDPs. 

For simplicity we present the results of [20] in the case that the control set a i/ is convex in M. Ud and the intermediate 
reward g(t,x t ,u t ) — for t <G [0,T]. The following Lagrangian term h* (the subscript g refers to "gradient-based") 
is used to penalize the relaxation of non-anticipative constraints on the control strategies: 

ft*(u,w) = J X(t,x t ,w) J u t dt. 

Then we consider the inner optimization problem with h* (parameterized by w): 

rT 

V g (t,x,Y?) = sup {A(xt)— X(s,x s ,Y?) T u s ds}, x t =x. (51) 
uef/(t) Ji 

Since x t = £t(Wt) an d only It depends on u t , we obtain an equivalent problem 

rT _ 

©(f,77,w)= sup {A o| r (Tj r )- / A(s,77. s ,w) T Ms}> Tj f = tj =&~ 1 {x), (52) 
ueW(t) J t 

where X(s,T] s ,w) = X(s,% s (t] s ),w). 

Suppose that u* = (u*(t ,x ( ))o< r <r is an optimal control to the problem (36)(37)(or (40)(37)). We will present 

one main result of [20] in Theorem 6 that characterizes X(t,x t ,w) such that u* is also optimal to the problem 

(40)(51) a.s. (in pathwise sense). For the sake of defining a proper X(t,x t ,w), [20] first introduced (pt(r]), which is 

the flow of 

^ = (^)-^ft)[*(*,4(ft),«*(*,4(ft)))-V l( i(r,&(? t ) ) «*(» ) §(? k )))«*(»,4(ft))] (53) 



for t e [0,r] with the terminal condition (p T = 77, where V u b denotes the nx d u Jacobian matrix of b with respect 
to u. We use (p t ~ l to denote the inverse flow of <p t . 

Theorem 6 (Gradient-Based Penalty, Theorem 2.1 in [20]) Consider the deterministic optimal control problem 
(43)(52) with the terminal reward ®(T,rj,w) = (Ao ^j)(r\) parameterized by w, where <^ r (rj) is the solution to (42) 
with <^o = 77. If we define 

l(t, n ,w) 4 a[A °y (l?))1 (§)- 1 (i ? )V^(r,&(i ? ), B *(r,&(i ? ))); (54) 
A(f,x,"0=Mf>&~ 1 (*),"'), 

f/zen M*(f,^ r (T])) is an optimal control for the problem (43)(52) a.s., and hence u*(t,x) is optimal for the problem 
(40)(51). We also have 

®(f,T],H')=Ao^((jr)- 1 (T7)); 

V g {t,x,w) =®(t,%- l (x),w); 
E tjX V g (t,x,w)=V{t,x). 



Remark 3 

1) V g (t,x,w) may NOT be equal to V(t,x) almost surely in pathwise sense. 

2) Based on Theorem 6, we have 

V(0,x)=EoMO,x,w)} 

= E Q , X [ sup {A(x T )-h* g (u,w)}] 

= E Q , x {A(x* T ) -h* g (u*, w)}, 

which implies Eo i;c [/j*(k*,W>)] =0. It can be seen that (u*,h*) satisfies the complementary slackness condition 
developed in Theorem 4. Therefore, h*(u,w) behaves exactly the same as an optimal penalty, though we 
have not shown that h*(u,w) € ^f(O). 

A complete proof of Theorem 6 is in [20]. Here we provide some insight on the design of the Lagrangian 
multiplier A(f,Tj,w) using the verification argument. If the value function ®(f,Tj,w) for the problem (43)(52) is of 
sufficient regularity, it should satisfy the following HJB equation 

^(f,T7,w) + sup{|^(f,T7,w)(^) _1 (T7)Kf,^(T7), M )-A(f,T7,w) T M } = 

with the terminal condition &(T, Tj,w) = Ao<i; r (T7). If we define 

A( f ,T7,w) = ^( f , I7 ,w)(^)- 1 (r 7 )V^( f ,^(T 7 ),«*( f ,^( J7 ))) 



as in (54), then the HJB equation becomes 

^(f,7 7 ,w) + SUp{^(f,77,w)(^)" 1 (T7)[fe(f,^(77), M )-V„/7(f,^(77), M *(f,^(7 7 ))) M ]} = 0. 
ot ue <% dt] dx 

We define the Hamiltonian J4?(t,T],u,Y/) with playing the role of the costate: 

^( f ,77,«,w) = |^( f ,77,w)(^)- 1 (7 7 )[fe(f,§(7?),«)-V^( f ,§(7?),«*(f,§('7)))"]- 

Because Jf(t, 7] , • , w) is strictly concave on (ensured by some technical conditions) and its gradient V u H(t, Tj , • , w) 



is 



V„^(f,T7, M ,w)-|^(f,77,w)(^)- 1 (j7)[V^(f,^(77),«)-V^(f,^(77),«*(f,^(7 7 )))]. (55) 

It can be seen by the first-order condition that 

min Jf (t,rj,u,w) = Jf?(t,T},u*(t,§(T})),vr) a.s.. 

Hence we have shown why u* (t ,£, t {r\)) is optimal for the problem (43)(52). [20] used the characteristics method 
to ensure the existence of a sufficient regular function ©(f,T],w) as a unique solution to the HJB equation, and a 
lengthy approximation argument is spent on passing the results in terms of x via the transformation x t = ^(ty). 

The derivation of h*(u,w) for the continuous-time optimal control problem is based on minimizing the Hamil- 
tonian due to the convexity assumption on the control set °i/ and the first order condition, which is analogous 
to that of the gradient-based penalty proposed in [9] for MDPs. The latter construction of the optimal penalty is 
more straightforward, as it only requires some basic knowledge in convex optimization. We briefly review their 
results in the setting of the introduction section. For simplicity we also assume gk{ x ki a k) = and srf is convex for 
k = 0,- • • ,K— 1, and assume A(xk(b., v)) is differentiable and concave in the control strategy sequence a for every 
sequence v. Consider the gradient-based penalty of the form 

M* g (a,y) = V a AMa*,v)) T (a-a*), (56) 

where a* is the optimal control, and V s A(xk (a, v)) is the gradient of the terminal reward with respect to a feasible 
strategy a € A; the first-order condition for optimizing (2) over control strategy sequences a € Aq implies 

E ^[M;(a,v)]<0, forallaeA G . 

With the gradient-based penalty M*(a,v) the inner optimization problem (4) becomes 

sup{ A(x K (a, v) ) - V a A(x K (a* , v) ) T (a - a* ) } . (57) 

aeA 

Note that the penalty M*(a,\) is linear in the feasible strategy a, and hence the inner optimization problem is a 
convex optimization problem in a. As a consequence, the first-order condition is sufficient to guarantee an optimal 
solution to (57): the gradient of the objective function in (57) is 

V.A(jc, r (a,v))-V.A(jcjr(a») ) (58) 



and it becomes zero if we take a = a*. Therefore, a* is the optimal solution to the inner optimization (57). Hence, 
it is straightforward to see that M*(a,v) is an optimal penalty in the sense that V *(x) = (JfM*)(x) for all x <G SE . 

It is obvious to see the analogy between using the first-order condition (55) and (58) to derive optimal penalties 
for the continuous-time problem and discrete-time problem, respectively. This is the reason why h* can be viewed 
as the "gradient-based penalty" in the dual formulation of controlled Markov diffusions. 

Appendix E 
Details for Numerical Experiments 

The dynamics of the market state and assets returns are the same as those considered in [38]. In particular, 
jif = —kfa, Hk = Mo + Mi fa, = G, of ' 1 = O^' 1 , and of' 2 = o^ 2 . The parameter values are listed in the following 
tables including ry, X, fio, Hi, o, o^' 1 , and o^' 2 . 

TABLE V 
Parameter 1 



log(R) 


Mo Ml 


a 


r f 




f 0.081 \ / 0.034 \ 

0.110 0.059 
V 0.130 / \ 0.073 / 




f 0.186 0.000 0.000 \ 

0.228 0.083 0.000 
^ 0.251 0.139 0.069 J 




0.01 




X 






0.336 


( -0.741 -0.037 -0.060 


) 


0.284 



TABLE VI 
Parameter 2 



log(R) 


Mo Mi 


a 


r f 




f 0.081 ^ 

0.110 
^ 0.130 J 




( 0.034 \ 

0.059 
v 0.073 j 






( 0.186 0.000 0.000 ^ 

0.228 0.083 0.000 
y 0.251 0.139 0.069 y 




0.01 





X 






1.671 




[ -0.017 0.149 0.058 ) 


1.725 



TABLE VII 
Parameter 3 



log(R) 


Mo Ml 


a 


r f 




/ 0.142 \ / 0.065 \ 

0.109 0.049 
^ 0.089 / \ 0.049 / 




/ 0.256 0.000 0.000 \ 

0.217 0.054 0.000 
v 0.207 0.062 0.062 J 




0.01 




X 






0.336 


( 


-0.741 -0.040 -0.034 ) 


0.288 



TABLE VIII 
Parameter 4 



log(R) 


Mo Mi 


a 


r f 




/ 0.142 ^ 

0.109 
^ 0.089 J 


( 0.061 ^ 

0.060 
\ 0.067 j 






/ 0.256 0.000 0.000 ^ 

0.217 0.054 0.000 
\ 0.206 0.062 0.062 J 




0.01 





X 






1.671 




[ -0.017 0.212 0.096 ) 


1.716 



